# Replication files for "What lies behind the returns to schooling: the role of labor market sorting and worker heterogeneity" by Portugal, Reis, Guimarães and Cardoso (Restat)

## Overview

This replication package contains the code needed to reproduce all tables and figures in the paper along with the outputs (tables, figures and logs). To obtain the source data see below.

## Data availability and provenance statements

Replication of the paper requires access to the restricted use data files of Quadros do Pessoal (QP). The authors had legitimate access and permission to use the Quadros do Pessoal data. These data are only available to accredited researchers and cannot be included in the replication package. Researchers interested in accessing the data can follow the procedures described in the website of INE (the Portuguese National Statistical Institute). The link with instructions for access is the following:

https://www.ine.pt/xportal/xmain?xpid=INE&xpgid=ine_serv_clientes&INST=161301979&ine_smenu.boui=161303032&ine_smenu.selected=161303243&xlang=en

## Software and computer environment

The code was run in a Singularity container with 4 CPUs in an High Throughput Computing linux environment with 32 cores and 512GB RAM

The software used was Stata version 17.0 and Python version 3.10.0. 
Python scripts are invoked from within Stata.  

## Instructions for replicators

The replication package contains two *tar* files. 

    - **code.tar** - contains all the code needed for the replication
    - **outputs.tar** - contains all the outputs generated by the code (with the exception of *dta* files)

To replicate the results of the paper you need the original data files from INE and the **code.tar** files. 

We will assume that the name of the replication folder is "Rep" (use whatever name you want). Save the **code.tar** in the folder and untar it. This should create the following folder structure.

```
Rep
│
│── ados/
│   ├── external
│   ├── internal
│
│── outputs/
│   ├── figures
│   ├── logs
│   ├── ster
│   ├── tables
│   ├── temp
│
│── progs
│
│── source

```
The folder **progs** contains all code files while the folder **ados** contains Stata ado files. All other folders are empty except for the **source** folder which contains the file "filelist.txt" with the list of the names of the source files from INE that were used in this work. You must place the source files from INE in the **source** folder and confirm that the files have the same names. If the names of the source files are different you will have to edit and adjust the files "01a_read_data_trab.do" and/or "01a_read_data_emp.do" located in the **progs** folder.

Next, you need to install the following ado files:

    - gtools (by Mauricio Bravo)
    - reghdfe (by Sergio Correia) 
    - ftools (by Sergio Correia)
    - egenmore (by Nick Cox)
    - estout (by Ben Jann)
    - ivreg2  (by Christopher Baum, Mark Schaffer and Steven Stillman)
    - ranktest (by Frank Kleibergen, Mark Schaffer and Frank Windmeijer)

These packages were all installed from the SSC GitHub mirror maintained by the Labor Dynamics Institute (https://github.com/labordynamicsinstitute/ssc-mirror). To make sure you use the same version as we did, install them using the date of 2024-01-15. 

Finally, you need to install the following Python package:

    - pytwoway (by Thibaut Lamadon and Adam Oppenheimer)

for documentation on the package see:

https://tlamadon.github.io/pytwoway/index.html 

Next, change to the **progs** folder and run the do file "00_master". If the replication is successful you should obtain the same outputs as those contained in **outputs.tar** (with the exception of temporary files which appear on the "/Rep/outputs/temp" folder).

## Description of do files

Below is a description of the files contained in the "/Rep/Progs" folder:

**00_master.do** runs all do files in sequence

**01a_read_data_trab.do** imports original year worker files (in SPSS format) and outputs them to a common Stata format on a subset of the needed variables. You may have to adjust INE file names if you obtain data with a different extraction date

**01b_read_data_emp.do**  imports original firm level files (in SPSS format) and outputs them to a common Stata format on a subset of the needed variables. You may have to adjust INE file names if you obtain data with a different extraction date

**02a_create_raw_trab_panel.do** creates a worker level panel and introduces corrections to the panel variables

**02b_create_raw_emp_panel.do** creates the firm level panel and introduces corrections to the panel variables

**03_final_panel.do** creates the final data file used in the analysis

**04_descriptive_statistics** descriptive statistics of the final data set

**05_tables_1+3.do** produces tables 1 and 3. It also produces the data used for figures 1, 2, and 3

**05_tables_4+5** produces tables 4 and 5 and the data for figure 4

**05_tables_6+7+8** produces tables 6, 7, and 8

**06_figures** produces all figures

**07_appendix** produces all tables in the appendix. 

## Execution times

Times for execution in our system were:

01a_read_data_trab.do           - 1,394 seconds (23m14s)
01b_read_data_emp.do            - 100 seconds (1m40s)
02a_create_raw_trab_panel.do    - 849 seconds (14m9s)
02b_create_raw_emp_panel.do     - 87 seconds (1m27s)
03_final_panel.do               - 7,304 seconds (2h1m44s)
04_descriptive_statistics.do    - 924 seconds   (15m24s)
05_tables_1+3.do                - 37,357 seconds (10h22m36s)
05_tables_4+5.do                - 11,654 seconds (3h14m13s)
05_tables_6+7+8.do              - 13,305 seconds (3h41m45s)
06_figures.do                   - 1,993 seconds (33m13s)
07_appendix                     - 26,358 seconds (7h19m17s)

Total running time: 28h8m45s  